{janitor}

Caroline Kostrzewa, Sabrina Lin

Topic Tuesdays: 2024-10-01

FYI

  • Using three data sets which we have “messed up” in various ways (all from Kaggle)
    • Species
    • Characters
    • Films


  • chisq and fisher test function (talk about masking)
    – put screenshot of warning – put actual function names – add stuff about masking

Species

id name classification designation µ_height skin_colors hAiR_cOlOrS eye_colors µ_lifespan language hOmEwOrLd
1 Human Mammal Sentient 1.80 Light, Dark Various Various 79 Galactic Basic Various
2 Yoda's species Unknown Sentient 0.66 Green White Brown 900 Galactic Basic Unknown
3 Wookiee Mammal Sentient 2.28 Brown Brown Blue 400 Shyriiwook Kashyyyk
4 Gungan Amphibian Sentient 1.96 Orange None Orange 70 Gungan Naboo
5 Twi'lek Mammal Sentient 1.80 Blue, Green, Red, Yellow None Various 80 Twi'leki Ryloth
6 Mon Calamari Amphibian Sentient 1.70 Orange None Red 70 Mon Calamarian Mon Cala
7 Rodian Reptilian Sentient 1.70 Green None Black 78 Rodese Rodia
8 Droid Artificial Non-sentient
Various None Various
Binary None
9 Trandoshan Reptilian Sentient 2.00 Green None Red 70 Dosh Trandosha
10 Kel Dor Mammal Sentient 1.88 Orange None Black 70 Kel Dor Dorin

Characters

id name species gender height weight hair_color eye_color skin_color year_born homeworld year_died description
1 Luke Skywalker Human Male 1.72 77 Blond Blue Light 19 Tatooine 34 The main protagonist of the original trilogy.
2 Leia Organa Human Female 1.50 49 Brown Brown Light 19 Alderaan 35 A leader in the Rebel Alliance and twin sister of Luke Skywalker.
3 Darth Vader Human Male 2.02 136 None Yellow Pale 41 Tatooine 4 The Sith Lord formerly known as Anakin Skywalker.
4 Yoda Yoda's species Male 0.66 17 White Brown Green 896 Unknown 4 A wise and powerful Jedi Master.
5 Han Solo Human Male 1.80 80 Brown Hazel Light 29 Corellia 34 A smuggler turned hero in the Rebel Alliance.
6 Chewbacca Wookiee Male 2.28 112 Brown Blue Brown 200 Kashyyyk
A loyal Wookiee and co-pilot of the Millennium Falcon.
7 Obi-Wan Kenobi Human Male 1.82 81 White Blue-gray Light 57 Stewjon 0 A Jedi Master who mentored Anakin and Luke Skywalker.
8 Emperor Palpatine Human Male 1.73 75 None Yellow Pale 82 Naboo 35 The Sith Lord who became Emperor of the galaxy.
9 Anakin Skywalker Human Male 1.88 84 Blond Blue Light 41 Tatooine 4 A Jedi Knight who fell to the dark side and became Darth Vader.
10 Padmé Amidala Human Female 1.65 45 Brown Brown Light 46 Naboo 19 The queen-turned-senator of Naboo and wife of Anakin Skywalker.

Films

id title release_date director producer opening_crawl sl_opinion ck_opinion
1 Episode I: The Phantom Menace 36299 George Lucas Rick McCallum Turmoil has engulfed the Galactic Republic... good
2 Episode II: Attack of the Clones 37392 George Lucas Rick McCallum There is unrest in the Galactic Senate... good
3 Episode III: Revenge of the Sith 38491 George Lucas Rick McCallum War! The Republic is crumbling under attacks... good
4 Episode IV: A New Hope 28270 George Lucas Gary Kurtz, Rick McCallum It is a period of civil war... good
5 Episode V: The Empire Strikes Back 29362 Irvin Kershner Gary Kurtz, Rick McCallum It is a dark time for the Rebellion... good
6 Episode VI: Return of the Jedi 30461 Richard Marquand Howard G. Kazanjian, George Lucas, Rick McCallum Luke Skywalker has returned to his home planet... good
7 Episode VII: The Force Awakens 42356 J.J. Abrams Kathleen Kennedy, J.J. Abrams, Bryan Burk Luke Skywalker has vanished...

8 Episode VIII: The Last Jedi 43084 Rian Johnson Kathleen Kennedy, Ram Bergman The FIRST ORDER reigns... good
9 Episode IX: The Rise of Skywalker 43819 J.J. Abrams Kathleen Kennedy, J.J. Abrams, Michelle Rejwan The dead speak!...

10 Rogue One: A Star Wars Story 42720 Gareth Edwards Kathleen Kennedy, Allison Shearmur, Simon Emanuel The Rebellion makes a risky move... good
11 Solo: A Star Wars Story 43245 Ron Howard Kathleen Kennedy, Allison Shearmur, Simon Emanuel Board the Millennium Falcon and journey to a galaxy far, far away... good

Cleaning

clean_names()

Cleans names of an object (usually a data.frame)

names(species_messy)
 [1] "id"             "name"           "classification" "designation"   
 [5] "µ_height"       "skin_colors"    "hAiR_cOlOrS"    "eye_colors"    
 [9] "µ_lifespan"     "language"       "hOmEwOrLd"     
species_messy |> 
  clean_names(
    case = "snake", #default case
  ) |> 
  names()
Warning in warn_micro_mu(string = string, replace = replace): Watch out!  The mu or micro symbol is in the input string, and may have been converted to 'm' while 'u' may have been expected.  Consider adding the following to the `replace` argument:
The following characters are in the names to clean but are not replaced: \u00b5
 [1] "id"               "name"             "classification"   "designation"     
 [5] "m_height"         "skin_colors"      "h_ai_r_c_ol_or_s" "eye_colors"      
 [9] "m_lifespan"       "language"         "h_om_ew_or_ld"   

clean_names() continued

names(species_messy)
 [1] "id"             "name"           "classification" "designation"   
 [5] "µ_height"       "skin_colors"    "hAiR_cOlOrS"    "eye_colors"    
 [9] "µ_lifespan"     "language"       "hOmEwOrLd"     
species_messy |> 
  clean_names(
    case = "snake", #default case
    replace = c("hAiR" = "hair", "cOlOrS" = "colors", "hOmEwOrLd" = "homeworld", 
                janitor:::mu_to_u)
  ) |> 
  names()
 [1] "id"             "name"           "classification" "designation"   
 [5] "u_height"       "skin_colors"    "hair_colors"    "eye_colors"    
 [9] "u_lifespan"     "language"       "homeworld"     

make_clean_names()

Returns a clean character vector

species_messy |> 
  select(name) |> 
  mutate(
    NAME_CLEAN = make_clean_names(name)
  )
# A tibble: 40 × 2
   name           NAME_CLEAN   
   <chr>          <chr>        
 1 Human          human        
 2 Yoda's species yodas_species
 3 Wookiee        wookiee      
 4 Gungan         gungan       
 5 Twi'lek        twilek       
 6 Mon Calamari   mon_calamari 
 7 Rodian         rodian       
 8 Droid          droid        
 9 Trandoshan     trandoshan   
10 Kel Dor        kel_dor      
# ℹ 30 more rows

get_dupes()

Shows the duplicated rows according to a certain column (or multiple columns).

characters_messy |> 
  get_dupes(name) |> 
  select(name, dupe_count, species, height, weight, homeworld) |> 
  print(n = Inf)
# A tibble: 28 × 6
   name                 dupe_count species height weight homeworld
   <chr>                     <int> <chr>    <dbl>  <dbl> <chr>    
 1 Chirrut Îmwe                  3 Human     1.75     70 Jedha    
 2 Chirrut Îmwe                  3 Human     1.75     70 Jedha    
 3 Chirrut Îmwe                  3 Human     1.75     70 Jedha    
 4 Saw Gerrera                   3 Human     1.78     75 Onderon  
 5 Saw Gerrera                   3 Human     1.78     75 Onderon  
 6 Saw Gerrera                   3 Human     1.78     75 Onderon  
 7 Baze Malbus                   2 Human     1.85     95 Jedha    
 8 Baze Malbus                   2 Human     1.85     95 Jedha    
 9 Bodhi Rook                    2 Human     1.8      75 Jedha    
10 Bodhi Rook                    2 Human     1.8      75 Jedha    
11 Cassian Andor                 2 Human     1.84     75 Fest     
12 Cassian Andor                 2 Human     1.84     75 Fest     
13 Ezra Bridger                  2 Human     1.75     70 Lothal   
14 Ezra Bridger                  2 Human     1.75     70 Lothal   
15 Galen Erso                    2 Human     1.83     75 Grange   
16 Galen Erso                    2 Human     1.83     75 Grange   
17 Grand Admiral Thrawn          2 Chiss     1.95     85 Csilla   
18 Grand Admiral Thrawn          2 Chiss     1.95     85 Csilla   
19 Hera Syndulla                 2 Twi'lek   1.78     55 Ryloth   
20 Hera Syndulla                 2 Twi'lek   1.78     55 Ryloth   
21 Jyn Erso                      2 Human     1.7      60 Vallt    
22 Jyn Erso                      2 Human     1.7      60 Vallt    
23 Kanan Jarrus                  2 Human     1.91     86 Coruscant
24 Kanan Jarrus                  2 Human     1.91     86 Coruscant
25 Orson Krennic                 2 Human     1.85     80 Lexrul   
26 Orson Krennic                 2 Human     1.85     80 Lexrul   
27 Sabine Wren                   2 Human     1.65     50 Mandalore
28 Sabine Wren                   2 Human     1.65     50 Mandalore

convert_to_date()

  • Can handle mixed types (“2024-10-01” [character] vs 344402 [numeric])
  • Will handle Excel-specific date numbers (through excel_numeric_to_date())
films_messy |> 
  mutate(release_date_clean = convert_to_date(release_date),
         release_data_lub = lubridate::as_date(release_date)) |> 
  select(title, release_date, release_date_clean, release_data_lub)
# A tibble: 11 × 4
   title                        release_date release_date_clean release_data_lub
   <chr>                               <dbl> <date>             <date>          
 1 Episode I: The Phantom Mena…        36299 1999-05-19         2069-05-20      
 2 Episode II: Attack of the C…        37392 2002-05-16         2072-05-17      
 3 Episode III: Revenge of the…        38491 2005-05-19         2075-05-21      
 4 Episode IV: A New Hope              28270 1977-05-25         2047-05-27      
 5 Episode V: The Empire Strik…        29362 1980-05-21         2050-05-23      
 6 Episode VI: Return of the J…        30461 1983-05-25         2053-05-26      
 7 Episode VII: The Force Awak…        42356 2015-12-18         2085-12-19      
 8 Episode VIII: The Last Jedi         43084 2017-12-15         2087-12-17      
 9 Episode IX: The Rise of Sky…        43819 2019-12-20         2089-12-21      
10 Rogue One: A Star Wars Story        42720 2016-12-16         2086-12-18      
11 Solo: A Star Wars Story             43245 2018-05-25         2088-05-26      

remove_empty

films_messy
# A tibble: 11 × 8
      id title           release_date director producer opening_crawl sl_opinion
   <dbl> <chr>                  <dbl> <chr>    <chr>    <chr>         <chr>     
 1     1 Episode I: The…        36299 George … Rick Mc… Turmoil has … good      
 2     2 Episode II: At…        37392 George … Rick Mc… There is unr… good      
 3     3 Episode III: R…        38491 George … Rick Mc… War! The Rep… good      
 4     4 Episode IV: A …        28270 George … Gary Ku… It is a peri… good      
 5     5 Episode V: The…        29362 Irvin K… Gary Ku… It is a dark… good      
 6     6 Episode VI: Re…        30461 Richard… Howard … Luke Skywalk… good      
 7     7 Episode VII: T…        42356 J.J. Ab… Kathlee… Luke Skywalk… <NA>      
 8     8 Episode VIII: …        43084 Rian Jo… Kathlee… The FIRST OR… good      
 9     9 Episode IX: Th…        43819 J.J. Ab… Kathlee… The dead spe… <NA>      
10    10 Rogue One: A S…        42720 Gareth … Kathlee… The Rebellio… good      
11    11 Solo: A Star W…        43245 Ron How… Kathlee… Board the Mi… good      
# ℹ 1 more variable: ck_opinion <lgl>
films_messy |> 
  remove_empty("cols")
# A tibble: 11 × 7
      id title           release_date director producer opening_crawl sl_opinion
   <dbl> <chr>                  <dbl> <chr>    <chr>    <chr>         <chr>     
 1     1 Episode I: The…        36299 George … Rick Mc… Turmoil has … good      
 2     2 Episode II: At…        37392 George … Rick Mc… There is unr… good      
 3     3 Episode III: R…        38491 George … Rick Mc… War! The Rep… good      
 4     4 Episode IV: A …        28270 George … Gary Ku… It is a peri… good      
 5     5 Episode V: The…        29362 Irvin K… Gary Ku… It is a dark… good      
 6     6 Episode VI: Re…        30461 Richard… Howard … Luke Skywalk… good      
 7     7 Episode VII: T…        42356 J.J. Ab… Kathlee… Luke Skywalk… <NA>      
 8     8 Episode VIII: …        43084 Rian Jo… Kathlee… The FIRST OR… good      
 9     9 Episode IX: Th…        43819 J.J. Ab… Kathlee… The dead spe… <NA>      
10    10 Rogue One: A S…        42720 Gareth … Kathlee… The Rebellio… good      
11    11 Solo: A Star W…        43245 Ron How… Kathlee… Board the Mi… good      

remove_constant

Removes columns with one constant value

films_messy |> 
  remove_constant()
# A tibble: 11 × 7
      id title           release_date director producer opening_crawl sl_opinion
   <dbl> <chr>                  <dbl> <chr>    <chr>    <chr>         <chr>     
 1     1 Episode I: The…        36299 George … Rick Mc… Turmoil has … good      
 2     2 Episode II: At…        37392 George … Rick Mc… There is unr… good      
 3     3 Episode III: R…        38491 George … Rick Mc… War! The Rep… good      
 4     4 Episode IV: A …        28270 George … Gary Ku… It is a peri… good      
 5     5 Episode V: The…        29362 Irvin K… Gary Ku… It is a dark… good      
 6     6 Episode VI: Re…        30461 Richard… Howard … Luke Skywalk… good      
 7     7 Episode VII: T…        42356 J.J. Ab… Kathlee… Luke Skywalk… <NA>      
 8     8 Episode VIII: …        43084 Rian Jo… Kathlee… The FIRST OR… good      
 9     9 Episode IX: Th…        43819 J.J. Ab… Kathlee… The dead spe… <NA>      
10    10 Rogue One: A S…        42720 Gareth … Kathlee… The Rebellio… good      
11    11 Solo: A Star W…        43245 Ron How… Kathlee… Board the Mi… good      

remove_constant

Can also specify whether NA’s should be considered different from a constant value

films_messy |> 
  remove_constant(
    na.rm = TRUE # default is FALSE
  )
# A tibble: 11 × 6
      id title                      release_date director producer opening_crawl
   <dbl> <chr>                             <dbl> <chr>    <chr>    <chr>        
 1     1 Episode I: The Phantom Me…        36299 George … Rick Mc… Turmoil has …
 2     2 Episode II: Attack of the…        37392 George … Rick Mc… There is unr…
 3     3 Episode III: Revenge of t…        38491 George … Rick Mc… War! The Rep…
 4     4 Episode IV: A New Hope            28270 George … Gary Ku… It is a peri…
 5     5 Episode V: The Empire Str…        29362 Irvin K… Gary Ku… It is a dark…
 6     6 Episode VI: Return of the…        30461 Richard… Howard … Luke Skywalk…
 7     7 Episode VII: The Force Aw…        42356 J.J. Ab… Kathlee… Luke Skywalk…
 8     8 Episode VIII: The Last Je…        43084 Rian Jo… Kathlee… The FIRST OR…
 9     9 Episode IX: The Rise of S…        43819 J.J. Ab… Kathlee… The dead spe…
10    10 Rogue One: A Star Wars St…        42720 Gareth … Kathlee… The Rebellio…
11    11 Solo: A Star Wars Story           43245 Ron How… Kathlee… Board the Mi…

Tidying

functions

  • round_half_up [kaggle characters - height]
  • round_to_fraction [kaggle characters - height]
  • signif_half_up [kaggle characters - height]
  • compare_df_cols (checking) [split kaggle characters into humans vs non-human and mess one up, then use these to check before row binding]
  • compare_df_cols_same (checking) [split kaggle characters into humans vs non-human and mess one up, then use these to check before row binding]

tabyl()

tabyl() Basics

tabyl() is halfway between table() and gtsummary::cross_tab()

## one variable
characters_messy |> 
  tabyl(death_status)
 death_status  n percent
        alive 36   0.375
         dead 60   0.625


## two variables
characters_messy |> 
  tabyl(death_status, bmi)
 death_status BMI High BMI Low NA_
        alive       11      22   3
         dead       26      33   1

adorn_*()

Functions to make your “tabyls” more readable:

  • adorn_title() – add column name to the top of a two-way table
  • adorn_totals() – add row/column totals to a tabyl
  • adorn_percentages() – converts tabyl of counts to percentages


Functions that come after using adorn_percentages()

  • adorn_pct_formatting() – formatting decimals as percentages
  • adorn_ns() – add underlying Ns to a tabyl with percentages
  • adorn_rounding() – round numeric columns

Adorn with totals and titles

characters_messy |> 
  tabyl(death_status, bmi, show_na = FALSE) |> 
  adorn_totals() |> 
  adorn_title()
                   bmi        
 death_status BMI High BMI Low
        alive       11      22
         dead       26      33
        Total       37      55

Adorn with percentages

characters_messy |> 
  tabyl(death_status, bmi, show_na = FALSE) |> 
  adorn_percentages() |> 
  adorn_pct_formatting()
 death_status BMI High BMI Low
        alive    33.3%   66.7%
         dead    44.1%   55.9%


characters_messy |> 
  tabyl(death_status, bmi, show_na = FALSE) |> 
  adorn_percentages() |> 
  adorn_pct_formatting() |> 
  adorn_ns()
 death_status   BMI High    BMI Low
        alive 33.3% (11) 66.7% (22)
         dead 44.1% (26) 55.9% (33)

FYI

  • untabyl – when would we want to do this??????

Takeaways

Takeaways

MAY THE FORCE BE WITH YOU